-
Notifications
You must be signed in to change notification settings - Fork 1.8k
OCPBUGS-54188: Update Pod interactions with Topology Manager policies #95111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@amolnar-rh: This pull request references Jira Issue OCPBUGS-54188, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
🤖 Mon Jun 23 15:25:53 - Prow CI generated the docs preview: https://95111--ocpdocs-pr.netlify.app/openshift-enterprise/latest/post_installation_configuration/node-tasks.html |
@amolnar-rh: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
@amolnar-rh: This pull request references Jira Issue OCPBUGS-54188, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some comments inside
@@ -32,9 +32,11 @@ spec: | |||
memory: "100Mi" | |||
---- | |||
|
|||
If the selected policy is anything other than `none`, Topology Manager would not consider either of these `Pod` specifications. | |||
If the selected policy is anything other than `none`, Topology Manager would consider either of the `BestEffort` or the `Burstable` QoS class `Pod` specifications. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure here. When the topology manager policy is not None, it will indeed try to align all pods, but for pods whose QoS class is not Guaranteed
, all the alignment logic will degrade in a no-operation. So, yes, we will do all the dance, but the result will be "no pinning, no alignment"
@@ -32,9 +32,11 @@ spec: | |||
memory: "100Mi" | |||
---- | |||
|
|||
If the selected policy is anything other than `none`, Topology Manager would not consider either of these `Pod` specifications. | |||
If the selected policy is anything other than `none`, Topology Manager would consider either of the `BestEffort` or the `Burstable` QoS class `Pod` specifications. | |||
When the Topology Manager policy is set to `none`, the relevant containers are pinned to any available CPU without considering NUMA affinity. This is the default behavior and does not optimize for performance-sensitive workloads. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we usually mean "pinning" as "run on a precise set of resources", so not sure the terminology is best here. "pinned to anything" is something I don't see used much, but I'm also not a native english speaker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about:
the relevant containers are assigned to run on any available set of CPUs...
Or should we keep it vague and instead of specifying CPU say resources?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"the relevant containers are assigned to run on any available set of CPUs..." seems fine to me
If the selected policy is anything other than `none`, Topology Manager would not consider either of these `Pod` specifications. | ||
If the selected policy is anything other than `none`, Topology Manager would consider either of the `BestEffort` or the `Burstable` QoS class `Pod` specifications. | ||
When the Topology Manager policy is set to `none`, the relevant containers are pinned to any available CPU without considering NUMA affinity. This is the default behavior and does not optimize for performance-sensitive workloads. | ||
Other values enable the use of topology awareness information from device plugins. The Topology Manager attempts to align the CPU, memory, and device allocations according to the topology of the node when the policy is set to other values than `none`. For more information about the available values, see _Additional resources_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
device plugins and core resources (cpu, memory)
@@ -53,6 +55,6 @@ spec: | |||
example.com/device: "1" | |||
---- | |||
|
|||
Topology Manager would consider this pod. The Topology Manager would consult the hint providers, which are CPU Manager and Device Manager, to get topology hints for the pod. | |||
Topology Manager would consider this pod. The Topology Manager would consult the Hint Providers, which are CPU Manager and Device Manager, to get topology hints for the pod. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CPU Manager, Device Manager and Memory Manager
@@ -16,15 +16,12 @@ This is the default policy and does not perform any topology alignment. | |||
|
|||
`best-effort` policy:: | |||
|
|||
For each container in a pod with the `best-effort` topology management policy, kubelet calls each Hint Provider to discover their resource | |||
availability. Using this information, the Topology Manager stores the preferred NUMA Node affinity for that container. If the affinity is not preferred, Topology Manager stores this and admits the pod to the node. | |||
For each container in a pod with the `best-effort` topology management policy, kubelet calls each Hint Provider to discover their resource availability. Using this information, the Topology Manager stores the preferred NUMA Node affinity for that container. If the affinity is not preferred, Topology Manager stores this and admits the pod to the node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is technically correct but maybe too low level. The observable behavior of the best-effort policy is that the kubelet will try to align all the required resources on a NUMA node, but if the allocation is impossible (no enough resources) the allocation will spill into other NUMA nodes unpredictably. The pod will always be admitted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to rephrase it. WDYT?
Kubelet tries to align all the required resources on a NUMA node according to the preferred NUMA node affinity for that container. Even if the allocation is not possible due to insufficient resources, the Topology Manager still admits the pod but the allocation is shared with other NUMA nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only reason I'm leaving out "unpredictably" is because I feel like we'd need to explain what that means exactly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your rephrasing seems fine to me, thanks
For each container in a pod with the `restricted` topology management policy, kubelet calls each Hint Provider to discover their resource | ||
availability. Using this information, the Topology Manager stores the preferred NUMA Node affinity for that container. If the affinity is not | ||
preferred, Topology Manager rejects this pod from the node, resulting in a pod in a `Terminated` state with a pod admission failure. | ||
For each container in a pod with the `restricted` topology management policy, kubelet calls each Hint Provider to discover their resource availability. Using this information, the Topology Manager stores the preferred NUMA Node affinity for that container. If the affinity is not preferred, Topology Manager rejects this pod from the node, resulting in a pod in a `Terminated` state with a pod admission failure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The observable behavior here is that the kubelet will determine the theoretical minimal number of NUMA nodes that can fullfil the request, and reject the admission if the actual allocation would take more than that number of NUMA nodes; otherwise the pod will go running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean that the "pod will go running"? Do you mean that the pod is admitted and it will run/operate?
Except for that part, I rephrased it:
kubelet determines the theoretical minimum number of NUMA nodes that can fulfill the request. If the actual allocation requires more than the that number of NUMA nodes, the Topology Manager rejects the admission, resulting in a pod in a
Terminated
state with a pod admission failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean that the "pod will go running"? Do you mean that the pod is admitted and it will run/operate?
yes, precisely.
|
||
`single-numa-node` policy:: | ||
|
||
For each container in a pod with the `single-numa-node` topology management policy, kubelet calls each Hint Provider to discover their resource availability. Using this information, the Topology Manager determines if a single NUMA Node affinity is possible. If it is, the pod is admitted to the node. If a single NUMA Node affinity is not possible, the Topology Manager rejects the pod from the node. This results in a pod in a Terminated state with a pod admission failure. | ||
For each container in a pod with the `single-numa-node` topology management policy, kubelet calls each Hint Provider to discover their resource availability. Using this information, the Topology Manager determines if a single NUMA Node affinity is possible. If it is, the pod is admitted to the node. If a single NUMA Node affinity is not possible, the Topology Manager rejects the pod from the node. This results in a pod in a `Terminated` state with a pod admission failure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The observable behavior is that the kubelet will admit the pod iff all the resources required by the pod itself can be allocated on a same NUMA node. Arguably, its the same as Restricted with minimal number of NUMA nodes
= 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL:
kubelet admits the pod if all the resources required by the pod can be allocated on the same NUMA node. If a single NUMA node affinity is not possible, the Topology Manager rejects the pod from the node. This results in a pod in a
Terminated
state with a pod admission failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Version(s): 4.12, 4.14, 4.15, 4.16. 4.17, 4.18, 4.19, 4.20
Issue: https://issues.redhat.com/browse/OCPBUGS-54188
Link to docs preview:
QE review:
Additional information: